{% extends "start.html" %}
{% block content %}

<div id="main">
  <div id="leftbar">
  <div id="search">
  <ol>
  <li><a href="#1_What_is_AnnoLex">What is AnnoLex?</a></li>
  <li><a href="#2_How_to_use_AnnoLex">How to use AnnoLex for second-phase curation</a></li>
  <li><a href="#3_How_Does_AnnoLex_Work">How Does AnnoLex Work?</a></li>
  <li><a href="#4_The_Edit_Panel">The Edit Panel</a></li>
  <li><a href="#5_What_AnnoLex_cannot_do">What AnnoLex cannot do</a></li>
  </ul>
  </div>
</div>
<div id="about">
<table class="about">
<tr><td>
<h2 align="center">Shakespeare His Contemporaries:<br>
Collaborative Curation of EEBO-TCP Texts with AnnoLex
</h2>

<p align="center">By Martin Mueller (Northwestern University)</p>

<a name="1_What_is_AnnoLex"></a><h2>1. What is AnnoLex?</h2>

<p>AnnoLex is a collaborative data curation tool for use with
EEBO-TCP texts. It is useful for the identification and correction of  incompletely or incorrectly transcribed words.
It can also be used for the manual correction of algorithmically applied
lemmatization and part-of-speech tagging. It is built using the Python-based
Django framework and stores its data in a MySQL database. Annolex has been
developed by Craig Berry under a grant from Academic
Research Technologies at Northwestern University.</p>

<p>This document 
explains how to use AnnoLex for the second-phase curation of "Shakespeare
His Contemporaries," a corpus of approximately 500 non-Shakespearean
plays between 1576 and 1642 that underwent a first-phase curation at the
hands of five Northwestern undergraduates working with me during the
summer of 2013 . Nayoon, Ahn, Hannah Bredar, Madeline Burg, Nicole
Sheriko, and Melina Yeh fixed ~36,000 manifest textual errors  with the
help of AnnoLex. They focused on errors that could be corrected with a
high degree of confidence by consulting EEBO images accessible from
within AnnoLex. They fixed ~36,000 of approximately 56,000 errors—not
bad for a first-round rough cleanup.  Most of the residual errors
require a look at the original printed page to be fixed with confidence.</p>

<p>This document addresses you as a person who can be persuaded to think of
an EEBO-TCP text as a "collaboratively curatable object" (CCO) and wants
to contribute to the improvement of this or that text by looking at the
pages of printed source, preferably the copy in the Rare Book Library
that provided the microfilm source for the digital scan from which the
text was transcribed.  Looking at the printed original will in many
cases make it easy to spot a transcriptional error that could not be
identified from the digital scan. There are 465 plays that still contain
one or more known errors.  Half of them contain fewer than twenty
problems, most of them simple and fixable in an hour or so.<p>

<p>A play that has gone through a second-phase clean-up and from which all
or most known errors have been removed is not a perfect diplomatic
edition, let alone a critical one.  It may contain philological
“cruxes” that stubbornly resist solution. Was it an "Indian" or
"Iudean" that "threw away a pearl richer than all his tribe"?  A lot of
philological good is done if all or most of the mundane and solvable
problems are fixed first, leaving the more interesting problems for
further and possibly endless speculation. In the interim a text with
only its hard problems unsolved will be a text that is good enough for
most readerly purposes.  And a dramatic corpus that has undergone this
level of curation is unlikely to bias or distort corpus-wide inquiries.</p>

<a name="2_How_to_use_AnnoLex"></a><h2>2. How to use AnnoLex for second-phase curation</h2>

<p>The current instance of AnnoLex builds on last summer's work and
contains only the text of pages that still contain one or more known
error(s).  A known error is a place in the text where the transcriber
could not identify a letter, word, or passage and marked it as a gap. 
AnnoLex uses the following symbols for different types of gaps:</p>

<ol>
<li>The black dot (●) is used for missing letters  on 6,687 occasions</li>
<li>The lozenge (◊) is used for missing words  on 4,524 occasions</li>
<li>The ellipsis (…) is used for a "span" of indeterminate length, but
less than three word is used on 1,377 occasions</li>
<li>The black square (■) is used for an ambiguous punctuation mark on 7,318 occasions</li>
</ol>

<p>The transcribers were conscientious in their counting of different
gaps. Thus a spelling like 'W●●t'  probably  means that the transcriber
accurately counted two missing letters. But it is not certain. Black
dots at the end of a word not infrequently denote punctuation marks that
should have been transcribed as black squares.  The reverse error is
very rare.</p>

<a name="3_How_Does_AnnoLex_Work"></a><h2>3. How Does AnnoLex Work?</h2>

<p>AnnoLex has two major views: <b>Correct</b> and <b>Review</b>.  They are accessible from the top menu
bar.  You may look at either, but
you must be logged in to suggest corrections in the <b>Correct </b>view, and you must have special editorial privileges to
approve corrections in the <b>Review</b> view.  Unless you are a
reviewer, you can ignore the <b>Review</b> panel.  A future revision of AnnoLex
may include a feature that lets you review and delete or amend your own
corrections. But this useful feature is not yet available. </p>

<p>A future revision of AnnoLex will also allow you to create your own
user account. For the time being you can only get a user account by
asking me for it via email at 
<a href="mailto:martinmueller@northwestern.edu?Subject=AnnoLex%20Account%20Request">
martinmueller@northwestern.edu</a></p>

<p>Neither a correction nor its approval changes the underlying
source text.  Think of a correction
as an annotation  attached to a
place in the text and of its approval as an additional annotation about the
status of that correction.  The actual correction of the source texts is
a separate process.  However, if a correction has been approved, AnnoLex
will display the corrected text so that the same error will not be
corrected multiple times. </p>

<h3>3.1 The Correct View</h3>

<p>In the <b>Correct View</b> your browser window is divided into three
parts. The entire right half is taken up by a display panel. The upper left
part is a search panel in which you define what you are looking for. The lower
left part is an edit panel where you make your suggestions. </p>

<p>The display panel shows you a <b>Spelling in Context </b>with
the spelling highlighted between the left and right context. In separate
columns it shows the spelling, lemma, and POS tag.  The last column
contains an <b>Edit</b> <b>button</b>, which activates the <b>Edit
panel</b>. For the purpose of this second-phase curation, the lemma and
POS values are irrelevant. </p>

<h4>3.1.1 The Search Panel</h4>

<p>The Search panel gives you various options for constraining your
search. There are nine options, which you may combine in any way, including
some that are unlikely to produce useful results. For the purpose of
correcting the residual errors in a given play the two critical options
are <b>Text</b> and <b>Filter</b>. </p>

<p>The <b>Text</b> option has a drop-down menu that lets you choose from plays
listed by author and title. Ignore the <b>All</b> option, which does not return
coherent results with the  current data set of AnnoLex</p>

<p>The <b>Filter</b> option lets you choose between <b>All</b> and
<b>Preselected</b>. The <b>Preselected</b> filter selects all tokens
that contain a black dot, a black square, an ellipsis , or a lozenge. In
other words, it selects all the "tokens of interest" for this curation
phase.   If you correct all of them and your corrections are approved,
the play you curated joins the list of plays with no known errors (which
is not the same thing as a play with no errors).</p>

<a name="4_The_Edit_Panel"></a><h2>4. The Edit Panel</h2>

<p>The <b>Edit</b> panel
occupies the lower left part of your browser window.  You must be logged in to save your
corrections. If, after clicking the <b>Edit</b> button, you do not see a
button that says: <code>View EEBO Image</code> you are not logged in.
</p>

<p>If you click on the Edit button on the right edge of the display panel, the top line
of the <b>Edit</b> panel changes and displays  a command of the following
kind:</p>

<p><code>Edit word 17-b-3790  from <i>Jacob and Esau</i></code></p>

<p>&quot;17-b-3790&quot; is a three-part unique identifier where</p>

<ol>
<li>The first part consisting of one or more digits, represents a digital image
identifier and retrieves that image, which is nearly always a double
page.  N.B.  The image number refers to the EEBO image set. It is not
the page number of the printed original.</li>

<li>The second part ('a' or 'b') tells you whether the word is found on the left ('a') or
right ('b') side of a double page image</li>

<li>The third part is a wordcounter incrementing by ten. In this case it identifies the
word as word 379.</li>
</ol>

<p>Your first task is to use the image number to find the page number of
your print original. If the original had page numbers they will show up
on the digital scan, and finding the page is trivial.  If the original
has no page number, it may take a little ingenuity to find the right
page quickly.</p>

<p>The word counter will tell you whether to look for the word towards
the top, the middle, or the bottom of your page.  A look at the size of
the page and the type font will let you calculate the rough number of
words on a page—typically between 250 and 350, but as many as 1,000 in
the case of double column folio texts, such as The 1647 edition of
Beaumont and Fletcher. </p>


<p>When you click on the Edit button something else happens:
the  labels for the <b>Spelling</b>, <b>Lemma</b>, and <b>POS </b> fields of the Edit panel will be
populated with data from the data whose <b>Edit</b>
button you clicked. Those values will stay the same until you click another
Edit button and populate the labels with new values. </p>

<h3>4.1 What happens when you correct an error?</h3>

<p>In order to correct an error, you must have a user account and
log in to AnnoLex.  You cannot
create your own user account but must request it from 
<a href="mailto:martinmueller@northwestern.edu?Subject=Request%20AnnoLex%20Account">martinmueller@northwestern.edu.</a>
</p>

<p>It is important to have a clear sense of what happens or does
not happen when you  spot an error
and correct it. It is impossible for you to overwrite the original text of the
source. A correction you make is a suggestion that is recorded as a distinct
transaction and passed on to an editor for review and approval. </p>

<p>You can suggest new values for <b>Spelling</b>, <b>Lemma</b>, or
<b>POS</b>, separately or together. You may gnore the values for
<b>Lemma</b> or <b>POS</b>, but do make use of <b>Annotation</b> where
appropriate.  It is a text field that lets you enter free text of any
kind. If you cannot decide on the proper reading of  a word, a simple
entry like 'crux' creates a record saying that a word has been looked at
but no solution has been found. That is useful. </p>



<p>Be sure to click the &quot;Save,&quot; button to save your correction or annotation.  
Clicking the save button enters a user transaction in a
separate correction table that automatically records:</p>

<ol>
<li>your user id</li>
<li>a time stamp for the transaction</li>
<li>the token id associated with the correction</li>
<li>your suggested new value for the spelling </li>
<li>an optional annotation indicating the rationale for the change</li>
</ol>

<p>You can see a record of that transaction if you switch from
the <b>Correct</b> to the <b>Review</b> view.  There you see a curation log, and its
five-column table tells you something about the workflow of AnnoLex. The first
column shows the correction with the original text in strike-through mode and
the replacement in bold.  You see
who made the correction and when. You see whether the correction has been
approved and by whom.  You also see
whether the approved correction has been &quot;applied,&quot; that is, incorporated into the source text.</p>

<h3>4.2 Three types of curation: Update, Insert, Delete</h3>

<p>In the lower left corner of the <b>Edit</b> field you see a drop-down selector that lets you choose from three values:
<code>Update, Insert, Delete</code> . They refer to
three different modes of curation. The default setting is <code>Update.</code> Keep in mind that this button sets a
mode of operation, but does not perform any action itself.  It is the <b>Save</b> button that executes the operation.</p>

<h4>4.2.1 Update</h4>

<p>In an Update operation you change the value of a  spelling , lemma, or POS tag, but this
change does not affect the sequence of tokens in the underlying text.  Changing &quot;staunderous&quot; to
&quot;slaunderous&quot; is an update. This is by far the most common form of
curation, and it is a simple and single-step operation. </p>

<h4>4.2.2 Insert and Delete</h4>

<p>Words that are wrongly joined or split are quite common in Early
Modern drama texts. Most of the cases were caught and fixed in the
first-phase curation. The current AnnoLex procedures for insert and
delete operations are cumbersome and error prone. If you come across
wrongly joined or split words, the simplest thing to do is to leave a
note in the annotation field saying 'wrongly joined' or 'wrongly
split'.</p>

<h3>4.3 <i>A digression about long &quot;s&quot;</i></h3>

<p>In some TCP texts long &quot;s&quot; is transcribed as such.
In others it is recorded as an ordinary &quot;s.&quot; Practice is consistent
within a text but inconsistent across the corpus. The AnnoLex data tables show
all forms of &quot;s&quot; in the normalized form. But in the texts from which
those tables are derived long &quot;s&quot; is preserved if it was transcribed
in the first place. Use the modern &quot;s&quot; in all your corrections. There
will be ways of adjusting spellings for those texts originally encoded with
long &quot;s.&quot; </p>

<a name="5_What_AnnoLex_cannot_do"></a><h2>5. What AnnoLex cannot do</h2>

<p>AnnoLex is a tool for the collaborative curation of the most common
errors in TCP texts. It does not help with missing paragraphs or pages
for several reasons including the simple fact that the page images may
be missing. The transcribers of TCP texts were instructed to ignore
passages written in other alphabets--mainly Greek or Hebrew.
Transcription of such passages requires quite specialized and
palaeographical skills. The same is true of texts with a lot of
mathematical, scientific, or musical notation. Successful curation of 
texts with such passages  probably requires an approach in which you
begin with a census of  what is missing and seek to match texts with
interested and qualified curators.  Such "brokering" is likely to be
more important than the choice of a particular curation tool.</p>

<p>The TCP transcriptions include some, but not very many, errors
in TEI-encoding.  Prose and verse
are not always encoded in &lt;p&gt; or &lt;l&gt; tags, a stage direction may be
tagged as a speaker label, and so forth.  AnnoLex is of no help for such cases.  There are important forms of curation
that go beyond the correction of  obvious errors. For instance, only about half of Early Modern plays have
cast lists and are clearly divided into acts and scenes.  AnnoLex is not a proper tool for the enrichment
of texts with metadata, although it is an excellent tool for the review and
correction of algorithmically supplied linguistic metadata. </p>

<p>In summary, AnnoLex is quite good at what it does, but what it
does is quite limited. However, and to repeat an earlier point: if the
scholarly community that works with Early Modern data took to collaborative and
dispersed curation. using AnnoLex to fix all or most of the little things that
can be fixed with, many of the EEBO-TCP texts would be in much better shape
and--excepting texts with sizable <i>lacunae--</i>
most of them would be good enough for most purposes. And the great thing about
&quot;good enough&quot; is that it is good enough. </p>



</td></tr>
</table>
</div>
</div>

{% endblock %}
